Reinforcement learning について

Words near each other

・ Reinhard von Rüppurr
・ Reinhard von Werneck
・ Reinhard Wendemuth
・ Reinhard Wilhelm
・ Reinhard Zahn
・ Reinhardsbrunn
・ Reinforced carbon–carbon
・ Reinforced concrete
・ Reinforced concrete column
・ Reinforced plastic
・ Reinforced rubber
・ Reinforced solid
・ Reinforced thermoplastic pipe
・ Reinforced-Concrete Highway Bridges in Minnesota MPS
・ Reinforcement
・ Reinforcement learning
・ Reinforcement Regiment
・ Reinforcement sensitivity theory
・ Reinforcement theory
・ Reinfried Herbst
・ Reingard M. Nischik
・ Reingardslivatnet
・ Reingers
・ Reingrabener Schiefer
・ Reingsen
・ Reingser Bach
・ Reinhalvøya
・ Reinhard
・ Reinhard Adler
・ Reinhard Aigen

Dictionary Lists

mini英和辞書

翻訳と辞書　辞書検索 [ 開発暫定版 ]

スポンサードリンク

Reinforcement learning ：ウィキペディア英語版

Reinforcement learning

Reinforcement learning is an area of machine learning inspired by behaviorist psychology, concerned with how software agents ought to take ''actions'' in an ''environment'' so as to maximize some notion of cumulative ''reward''. The problem, due to its generality, is studied in many other disciplines, such as game theory, control theory, operations research, information theory, simulation-based optimization, multi-agent systems, swarm intelligence, statistics, and genetic algorithms. In the operations research and control literature, the field where reinforcement learning methods are studied is called ''approximate dynamic programming''. The problem has been studied in the theory of optimal control, though most studies are concerned with the existence of optimal solutions and their characterization, and not with the learning or approximation aspects. In economics and game theory, reinforcement learning may be used to explain how equilibrium may arise under bounded rationality.
In machine learning, the environment is typically formulated as a Markov decision process (MDP) as many reinforcement learning algorithms for this context utilize dynamic programming techniques. The main difference between the classical techniques and reinforcement learning algorithms is that the latter do not need knowledge about the MDP and they target large MDPs where exact methods become infeasible.
Reinforcement learning differs from standard supervised learning in that correct input/output pairs are never presented, nor sub-optimal actions explicitly corrected. Further, there is a focus on on-line performance, which involves finding a balance between exploration (of uncharted territory) and exploitation (of current knowledge). The exploration vs. exploitation trade-off in reinforcement learning has been most thoroughly studied through the multi-armed bandit problem and in finite MDPs.
== Introduction ==

The basic reinforcement learning model consists of:
# a set of environment states

S

;
# a set of actions

A

;
# rules of transitioning between states;
# rules that determine the ''scalar immediate reward'' of a transition; and
# rules that describe what the agent observes.
The rules are often stochastic. The observation typically involves the scalar immediate reward associated with the last transition.
In many works, the agent is also assumed to observe the current environmental state, in which case we talk about ''full observability'', whereas in the opposing case we talk about ''partial observability''. Sometimes the set of actions available to the agent is restricted (e.g., you cannot spend more money than what you possess).
A reinforcement learning agent interacts with its environment in discrete time steps.
At each time

t

, the agent receives an observation

o_t

, which typically includes the reward

r_t

.
It then chooses an action

a_t

from the set of actions available, which is subsequently sent to the environment.
The environment moves to a new state

s_

and the reward

r_

associated with the ''transition''

(s_t,a_t,s_)

is determined.
The goal of a reinforcement learning agent is to collect as much reward as possible. The agent can choose any action as a function of the history and it can even randomize its action selection.
When the agent's performance is compared to that of an agent which acts optimally from the beginning, the difference in performance gives rise to the notion of ''regret''.
Note that in order to act near optimally, the agent must reason about the long term consequences of its actions: In order to maximize my future income I had better go to school now, although the immediate monetary reward associated with this might be negative.
Thus, reinforcement learning is particularly well suited to problems which include a long-term versus short-term reward trade-off. It has been applied successfully to various problems, including robot control, elevator scheduling, telecommunications, backgammon and checkers (Sutton and Barto 1998, Chapter 11).
Two components make reinforcement learning powerful:
The use of samples to optimize performance and the use of function approximation to deal with large environments.
Thanks to these two key components, reinforcement learning can be used in large environments in any of the following situations:
* A model of the environment is known, but an analytic solution is not available;
* Only a simulation model of the environment is given (the subject of simulation-based optimization);〔
〕
* The only way to collect information about the environment is by interacting with it.
The first two of these problems could be considered planning problems (since some form of the model is available), while the last one could be considered as a genuine learning problem. However, under a reinforcement learning methodology both planning problems would be converted to machine learning problems.

抄文引用元・出典: フリー百科事典『ウィキペディア（Wikipedia）』
■ウィキペディアで「Reinforcement learning」の詳細全文を読む

スポンサードリンク

翻訳と辞書 : 翻訳のためのインターネットリソース